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Abstract 

This  paper  describes  new  mapping  algorithms  for  domain-oriented  data-parallel  computa¬ 
tions,  where  the  workload  is  distributed  irregularly  throughout  the  domain,  but  exhibits  localized 
communication  patterns.  We  consider  the  problem  of  partitioning  the  domain  for  parallel  pro¬ 
cessing  in  such  a  way  that  the  workload  on  the  most  heavily  loaded  processor  is  minimized, 
subject  to  the  constraint  that  the  partition  be  perfectly  rectilinear.  Rectilinear  partitions  are 
useful  on  architectures  that  have  a  fast  local  mesh  network  and  a  relatively  slower  global  net¬ 
work;  these  partitions  heuristically  attempt  to  maximize  the  fraction  of  communication  carried 
by  the  local  network.  This  paper  provides  an  improved  algorithm  for  finding  the  optimal  par¬ 
tition  in  one  dimension,  new  algorithms  for  partitioning  in  two  dimensions,  and  shows  that 
optimal  partitioning  in  three  dimensions  is  NP-complete.  We  discuss  our  application  of  these 
algorithms  to  real  problems. 
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1  Introduction 


One  of  the  most  important  problems  one  must  solve  in  order  to  use  parallel  computers  is  the 
mapping  of  the  workload  onto  the  architecture.  This  problem  has  attracted  a  great  deal  of  attention 
in  the  literature,  leading  to  a  number  of  problem  formulations.  One  often  views  the  computation 
in  terms  of  a  graph,  where  nodes  represent  computations  and  edges  represent  communication; 
for  example,  see  [2].  Mapping  means  assigning  each  node  to  a  processor;  this  is  equivalent  to 
partitioning  the  nodes  of  the  graph,  with  the  tacit  understanding  that  nodes  in  a  common  partition 
set  are  assigned  to  the  same  processor.  We  will  use  the  terms  interchangeably.  A  common  mapping 
problem  formulation  views  the  architecture  as  a  graph  whose  nodes  are  processors  and  whose 
edges  identify  processors  able  to  communicate  directly.  The  dilation  of  a  computation  graph  edge 
(li,  v)  is  the  mini  -m  distance  (in  the  processor  graph)  between  the  processors  to  which  u  and 
V  are  respectively  assigned.  The  dilation  of  the  graph  itself  is  the  maximum  dilation  among  all 
computation  graph  edges.  Dilation  is  a  measure  of  how  well  the  mapping  preserves  locality  between 
nodes  in  the  mapped  computation  graph.  Results  concerning  the  minimization  of  dilation  can  be 
found  in  [4,  9,  16,  21],  and  their  references. 

Another  formulation  (the  one  we  study)  directly  models  execution  time  of  a  data  parallel  com¬ 
putation  as  a  function  of  the  chosen  mapping,  and  attempts  to  find  a  mapping  that  minimizes  the 
execution  time.  Workload  may  again  be  represented  as  a  graph,  with  edges  representing  data  com¬ 
munication,  e.g.,  the  stencils  used  in  some  partial  differential  equation  solvers  [18].  In  its  simplest 
form  each  node  is  assumed  to  have  unit  execution  weight;  more  general  forms  permit  nodes  to  have 
individual  weights.  Nodes  are  mapped  to  processors  in  such  a  way  that  each  processor’s  sum  of  node 
weights  is  approximately  the  same,  for  example,  see  [1,  3, 19].  A  rigorous  treatment  of  partitioning 
three  dimensional  finite-difference  and  finite-element  domains  is  found  in  [23];  unlike  our  treatment 
here,  the  shape  of  the  subdomains  are  not  a  consideration.  Minimization  of  communication  costs 
subject  to  load-balancing  constraints  is  considered  in  [6];  other  formulations  use  simulated  anneal¬ 
ing  or  neural  networks  to  minimize  an  “energy”  function  that  heuristically  quantifies  the  cost  of 
the  partition  [7].  Other  interesting  formulations  consider  mapping  highly  structured  computations 
onto  pipelined  multiprocessors[14],  and  mapping  systolic  algorithms  onto  hypercubes  [10]. 

This  paper  considers  the  Rectilinear  Partitioning  Problem  (RPP):  find  an  optimal  rectilinear 
partition  of  a  domain  containing  irregularly  weighted  workload.  One  may  view  the  workload  as 
being  concentrated  at  discrete  coordinates  within  the  domain;  alternatively  one  may  represent  the 
domain’s  workload  in  a  workload  matrix,  each  of  whose  elements  represents  all  the  workload  within 
a  rectangular  fixed-sized  region  of  the  domain.  A  domain  with  irregularly  distributed  discrete 
workload  can  always  be  transformed  into  a  workload  matrix;  our  problem  formulation  will  thus 
be  stated  in  terms  of  the  matrix  view.  A  rectilinear  partition  of  a  workload  matrix  requires  each 
partition  element  to  be  an  appropriately  dimensioned  “rectangle”  whose  dimensions  exactly  match 
that  of  each  neighbor  at  each  face.  “Rectangles”  in  one  dimension  are  intervals;  a  “rectangle”  in 
three  dimensions  is  a  rectangular  solid.  The  cost  of  a  rectangle  is  the  sum  of  all  workload  within 
its  boundaries;  the  cost  of  the  partition  is  the  maximum  cost  among  all  its  rectangles.  The  idea 
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Figure  1:  Two  dimensional  4x4  rectilinear  partition  oi  a  workload  matrix  representing  a  two- 
dimensional  domain 


is  to  let  the  weight  of  a  rectangle  be  the  time  required  to  execute  its  implicitly  assigned  workload 
on  a  processor.  The  maximum  such  defines  the  computatioi'  ’s  finishing  time  (or  its  inverse  defines 
the  processing  rate)  when  each  rectangle  is  assig:  ed  to  a  different  processor.  Figure  1  illustrates  a 
rectilinear  partition  in  two  dimensions. 

RPP  arises  when  executing  physically-oriented  numerical  computations  on  certain  types  of 
mesh-connected  multiprocessors.  For  example,  some  compi'tc.tion8  are  based  on  grids  that  discretize 
a  one,  two  or  three  dimensional  field  for  numerical  sob  '‘n  n  x  m  workload  matrix  can  be 
construcieU  by  pre-aggregating  adjacent  grid  points  to  create  a  rectangular  structure;  alternatively, 
one  may  chose  n  and  m  so  large  that  at  most  one  grid  point  is  reprt-sei’ted  in  one  cell  of  the  n  X  m 
domain.  RPP  is  motivated  by  parallel  architectures  that  support  very  fast  “local”  communication 
over  a  mesh  network,  and  significantly  slower  “global”  communication.  The  Ooiinection  Machine 
provides  an  example;  the  speed  differential  between  communication  using  the  local  network  and  the 
global  router  is  roughly  a  factor  of  six  on  problems  with  regular  communication  patterns  [22].  It  can 
be  worse  if  the  global  network  suffers  significant  contention.  Since  communication  requirements 
in  domain-oriented  computations  are  often  localized  in  space,  rectilinear  partitions  will  tend  to 
maximize  the  volume  of  communication  which  can  use  the  fast  mesh  network. 

It  is  possible  to  partition  a  domain  with  irregularly  distributed  discrete  workload  into  quadrilat¬ 
erals  whose  faces  match  exactly,  as  do  rectilinear  partitions.  Such  partitions  will  have  the  desirable 
locality  of  communication  properties  we  seek.  However,  rectilinear  partitions  have  the  advantage 
of  being  expressed  simply.  One  benefit  is  that  one  can  always  compv,te  the  processor  id  of  a  point 
(x,  y)  with  whom  wishes  to  communicate:  a  binary  search  on  the  list  of  cuts  in  the  X  dimension 
establishes  the  X  processor  coordinate  of  (x,  y)’s  processor,  another  search  establishes  the  Y  pro¬ 
cessor  coordinate.  Simplicity  of  expression  also  implies  simplicity  of  construction.  There  is  some 
advantage  to  choosing  IV  +  M  -  2  cuts  instead  of  choosing  (A^  -  1)  x  (M  -  1)  cuts.  Finally,  the 
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mathematical  regularity  of  rectl!'near  partitions  make  them  interesting  objects  in  their  own  right. 

We  consider  partitioning  in  one,  two,  and  three  dimensions.  It  should  be  noted  that  a  three  di¬ 
mensional  domain  can  always  be  partitioned  for  a  two  or  one  dimensional  processor  array;  likewise, 
a  two  dimensional  donain  can  be  partitioned  for  a  one  dimensional  processor  array.  Thus,  the  par¬ 
titioning  dimension  describes  the  communication  topology  of  the  target  architecture.  A  distinction 
can  be  made  between  processor  mashes  that  directly  connect  diagonally  adjacent  processors,  and 
those  that  don't.  The  algorithms  we  develop  here  are  primarily  concerned  with  the  latter.  They 
may  be  used  on  more  fully  connected  meshes,  but  do  not  attempt  to  take  advantage  of  the  extra, 
connectivity. 

RPP  is  a  challenging  problem,  as  it  is  similar  to  certain  NP-complete  problems,  but  is  also 
similar  to  problems  with  polynomial  complexity.  It  is  already  known  that  the  one-dimensional 
problem  can  be  solved  in  polynomial  time  [3,  5.  17].  Our  first  result  is  to  improve  upon  the  best 
published  ID  algorithm  to  date,  for  the  case  when  the  computation’s  size  greatly  exceeds  the 
number  of  processors.  Nex.  we  consider  RPP  in  two  dimensions.  We  show  that  if  the  partition  in 
one  dimension  (say  s)  is  fixed,  then  the  optimal  partitioning  in  the  other  dimension  can  be  found  in 
polynomial  time.  This  result  has  at  least  two  applications.  First,  it  can  be  used  to  find  the  optimal 
2D  rectilinear  partition;  one  simply  generates  all  partitions  in  one  dimension,  and  finds,  for  each, 
the  optimal  partitioning  in  the  other.  While  this  procedure  is  correct,  it  has  unreasonably  high 
complexity.  For  this  reason  we  develop  a  2D  iterative  refinement  heuristic  based  on  our  ability  to 
find  conditionally  optimal  partitions.  During  each  iteration,  one  finds  the  optimal  partitioning  in 
one  dimension,  given  a  fixed  partition  in  the  other.  The  next  iteration  uses  the  solution  just  found  as 
the  fixed  partition,  and  optimally  solves  in  the  other  dimension.  One  then  iterates  until  the  partition 
stops  changing.  The  procedure  is  guaranteed  to  converge  monotonically  to  a  local  minimum  in  the 
solution  space.  We  discuss  application  of  this  technique  to  two-dimensional  problems  arising  in  fluid 
dynamics  calculations,  and  compare  the  quality  of  solutions  produced  by  the  heuristic  with  solutions 
produced  by  algorithms  having  fev/er  restrictions  on  the  partitioning,  e.g.,  binary  dissection.  We 
find  that  rectilinear  partitions  can  achieve  better  performance  than  the  other  methods,  especially 
when  the  grid  edges  are  oriented  in  two  orthogonal  directions,  or  when  global  communication  is 
an  order  of  magnitude  slower  than  local  communication.  Our  last  contribution  is  to  show  that  the 
problem  of  finding  an  optimal  rectilinear  partition  in  three  dimensions  is  NP-complete. 

The  remainder  of  this  paper  is  organized  as  follows.  In  Section  §2  we  introduce  some  notation 
and  develop  the  cost  function  we  wish  to  minimize.  In  Section  §3  we  give  an  improved  solution 
to  RPP  in  one  dimension.  Section  §4  examines  RPP  in  two  dimensions,  and  Section  §5  proves 
the  NP-completeness  of  finding  optimal  three-dimensional  partitions.  Section  §6  summarizes  our 
results. 

2  Preliminaries 

This  section  introduces  some  notation  used  throughout  the  paper.  The  discussion  to  follow  speaks 
of  the  two  dimensional  partitioning  problem;  the  extension  to  three  dimensions  is  immediate,  and 
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the  projection  to  one  dimension  simply  involves  dropping  notational  dependence  on  indices  in  one 
dimension. 

We  define  the  partitioning  problem  as  follows.  Consider  an  n  X  m  load  matrix  where  each 
entry  Lij  >  0  represents  the  cost  of  executing  some  workload.  For  example,  we  might  create  a  load 
matrix  from  a  discretized  domain  by  prepartitioning  the  domain  into  many  rectangular  work  pieces, 
and  assign  load  value  lij  to  work  piece  Wij,  based  on  the  number  of  grid  points  and  edges  defined 
within  Wij.  In  the  limit,  we  may  prepartition  the  domain  so  finely  that  a  work  piece  represents  at 
most  one  grid  point  and  its  edges. 

Our  problem  is  to  partition  the  load  matrix  for  execution  on  an  JV  x  M  array  of  processors, 
as  follows.  A  partition  is  defined  to  be  a  pair  of  vectors  {R,  C),  where  R  is  an  ordered  set  of  row 
indices  R  =  (ro,ri,...,rw),  C  is  an  ordered  set  of  column  indices  C  =  and  we 

understand  that  tq  =  cq  =  0,  r^  =  fn,  and  cn  —  n.  Given  {R,  C),  the  execution  load  on  processor 
Pid  is  the  sum  of  the  weights  of  all  the  work  pieces  with  r,_i  <  x  <  r,-,  and  Cj-i  <  y  <  Cj. 
This  is  given  by 

Xij(.R,c)^  Y,  E 

*=r,_i+l  v=c,_i+l 

We  take  the  overall  cost  of  the  partition  to  be  the  maximal  execution  load  assigned  to  any  processor: 

ir{R,C)=  max.  {Xi,j{R,C)}. 

all  I  and  $ 

This  cost  is  known  as  the  bottleneck  value  for  the  partition.  Our  object  is  to  find  partition  vectors 
R  and  C  that  minimize  the  bottleneck  value. 

We  have  chosen  not  to  include  explicit  communication  costs  in  this  model.  This  is  a  largely 
practical  decision.  The  data  communication  inherent  in  a  computational  problem  tends  to  be  pro¬ 
portional  to  execution  costs.  This  means  that  by  balancing  the  execution  load  we  will  have  greatly 
balanced  the  communication  load  also,  at  least  if  the  bandwidth  of  the  network  is  high  enough  for 
us  to  ignore  contention.  It  is  also  true  that  the  execution  weights  Li,j  are  only  estimates  to  be¬ 
gin  with;  it  seems  unlikely  that  a  more  complicated  model  will  find  significantly  better  partitions. 
Finally,  we  are  assuming  that  rectilinear  partitioning  is  desirable  because  local  communication 
is  much  cheaper  than  global  communication.  If  we  can  ensure  that  the  partition  supports  local 
communication  we  will  have  gone  a  long  way  towards  minimizing  communication  overhead.  Our 
empirical  study  discussed  in  §4.5  bears  out  this  intuition. 

3  One  Dimensional  Partitioning 

FPP  in  one  dimension  has  been  extensively  studied  as  the  chains-on-chaina  partitioning  problem 
[3,  5,  11.  12,  17]:  we  are  given  a  linear  sequence  of  work  pieces  (called  modules),  and  wish  to 
par^'ition  the  sequence  for  execution  on  a  linear  array  of  processors.  Until  recently,  the  best  pub¬ 
lished  algorithm  found  the  optimal  partitioning  in  O(Mmlogm)  time,  where  M  is  the  number  of 
processors  and  m  is  the  number  of  modules.  This  solution  and  those  developed  in  [11]  and  [17]  all 
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involve  repeatedly  calling  a  probe  function.  A  recently  discovered  dynamic  programming  formula¬ 
tion  [5]  reduces  the  complexity  further  to  0{Mm).  The  solution  we  present  has  a  complexity  of 
0(m  -f-  (Af  logm)*),  which  is  better  than  0(Mm)  when  M  =  0(m/log^  m).  This  solution  is  also 
based  on  a  probe  function,  which  we  now  discuss  in  more  detail. 

In  one  dimension  we  are  to  partition  a  chain  of  modules  with  weights  , . . . , 

into  M  contiguous  subchains.  We  use  a  function  probe,  which  accepts  a  bottleneck  constraint  W 
and  determines  whether  any  partition  exists  with  bottleneck  value  W\  where  W'  <  W,  Candidate 
constraints  all  have  the  form  j  =  Yjk=i  because  v/e  know  that  the  optimal  cost  is  defined  by 
the  load  we  place  on  some  processor.  If  we  precompute  the  m  sums  Wxj  (j  =  1, . . . ,  m),  then  any 
candidate  value  Wij  can  be  generated  in  0(1)  time  using  the  relationship  =  Wij  - 

Given  bottleneck  constraint  W,  probe  attempts  to  construct  a  partition  with  a  bottleneck 
value  no  greater  than  W.  The  first  processor  must  contain  the  first  module;  probe  finds  the 
largest  index  t'l  such  that  <  W,  and  assigns  modules  1  through  ii  to  the  first  processor. 
Because  the  sums  Wij  increase  in  j,  ii  can  be  found  with  a  binary  search.  It  follows  that  the  first 
module  in  the  second  processor  is  Wij+i.  probe  then  loads  the  second  processor  with  the  longest 
subchain  beginning  with  that  does  not  overload  the  processor.  This  process  continues  until 
either  all  modules  have  been  assigned,  or  the  supply  of  processors  is  exhausted.  In  the  former  case 
we  know  that  a  feasible  partition  with  weight  no  greater  than  W  exists.  In  the  latter  case  we 
know  that  this  greedy  approach  does  not  produce  a  feasible  partition.  However,  it  has  been  shown 
(and  indeed  is  quite  straightforward  to  see)  that  the  greedy  approach  will  find  a  solution  with  cost 
no  greater  than  W  if  any  solution  exists  with  cost  no  greater  than  W.  Since  the  loading  of  each 
processor  requires  only  a  binary  search,  the  cost  of  one  probe  call  is  O(Mlog  m). 

All  solutions  based  on  probe  search  the  space  of  bottleneck  constraints  for  the  minimal  one,  say 
Wopt,  such  that  probe(lVop<)  =  t'ue.  The  partition  probe  generates  given  bottleneck  constraint 
Wgpt  is  optimal.  The  solution  in  [17]  examines  no  more  than  Am  candidate  constraints,  which  gives 
the  ID  partitioning  problem  an  overall  time  complexity  of  0{Mm\ogm).  As  argued  by  Iqbal[ll], 
another  easy  way  to  probe  is  to  compute  the  sum  of  all  workload  in  the  domain,  say  Z,  and 
choose  a  discretization  length  €.  One  may  then  conceptually  discretize  the  interval  [0,Z]  into  Z/e 
constraints,  and  use  a  binary  search  to  find  the  minimum  feasible  constraint.  This  approach  has 
a  complexity  of  0(Mlog(Z/c)logm),  although  the  cost  of  a  partition  it  finds  is  guaranteed  only 
to  be  within  e  of  optimal.  The  only  disadvantage  of  this  method  occurs  when  log{Zfe)  is  large 
relative  to  m,  in  which  case  one  may  choose  to  search  more  cleverly.  Towards  this  end  v/e  next 
develop  the  paper’s  first  contribution,  a  searching  technique  that  finds  the  optimal  partition  after 
only  O(Mlogm)  probe  calls^. 

Let  Wgpt  be  the  minimal  constraint  for  which  probe  returns  value  true.  The  new  search 
strategy  exploits  the  following  structure  of  an  optimal  solution  constructed  by  probe.  Suppose 
processor  F  is  the  first  processor  assigned  a  load  whose  weight  is  exactly  Wgpt.  The  loads  on  all 
processors  1  through  i  -  1  must  be  strictly  less  than  Wgpt,  and  hence  their  loads  are  not  feasible 

'This  result  was  originally  observed  by  Iqbal  (private  communication).  We  present  an  independently  discovered 
proof  (and  algorithm)  which  easily  extends  to  a  2D  problem. 
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bottleneck  constraints.  However,  the  greedy  construction  process  ensures  that  the  load  on  each 
processor  up  to  JP  is  as  large  as  possible.  For  example,  processor  1  is  loaded  with  the  longest 
subchain,  beginning  with  wi,  whose  weight  does  not  exceed  W^pt.  For  any  W  <  Wopi  yfB  have 
probe(TF')  =  false;  let  ii  be  the  largest  index  such  that  probe(Wi,,-i)  =  false.  Consider  the 
relationships 

Wu,  <  Wopt  < 

If  Wopt  <  W'i.ii+i  (i-e.,  if  1  <  F)  then  modules  1  through  n  will  be  assigned  to  processor  1,  otherwise 
module  t'l  -f  1  will  also  be  assigned  to  1.  Supposing  that  1  <  F,  the  subchain  assigned  to  processor 
2  begins  with  module  define  t2  to  be  the  largest  index  for  which  probe(lF{j4.i,{,)  ~  false. 
Under  the  greedy  assignment,  processor  2’s  last  module  is  either  w;,  or  Wi-j+i,  depending  on  whether 
F  =  2.  We  may  carry  out  this  process  for  each  processor:  given  ij,  ij+i  is  the  largest  index  for 
which  probe(Wi^+i,ij^i)  =  false.  For  each  ij  define  Uj  =  From  the  discussion  above 

it  is  apparent  that  when  j  <  F,  ij  is  the  first  module  assigned  to  processor  j  under  the  optimal 
greedy  partition.  Uj  is  the  smallest  feasible  constraint  arising  from  any  subchain  beginning  with 
module  Wi^.  Therefore,  Wopt  =  WF*  Furthermore,  each  Wj  for  j  >  F  is  a  feasible  constraint,  and 
hence  must  be  at  least  as  large  as  Wopt-  This  proves  the  following  lemma. 

Lemma  1  Let  Wopt  be  the  minimal  feasible  bottleneck  weight.  Then  Wopt  —  mini<,<A/{u;j}. 

An  important  point  is  that  the  definitions  of  the  t/s  and  tnij's  in  no  way  depend  on  knowledge 
of  either  F  or  Wopt.  We  may  discover  Wopt  by  generating  each  constraint  w,-,  and  choosing  the 
least.  In  order  to  find  ii  (and  hence  wi)  we  need  to  search  the  space  of  all  weights  having  the  form 
Wij,  As  we  have  already  seen,  this  space  can  be  searched  with  only  O(logm)  calls  to  probe.  Each 
probe  call  costs  0(M  log  m);  the  cost  of  finding  wi  is  thus  0{M  log*  m).  Similarly,  given  u ,  we  find 
*2  using  a  binary  search  over  all  weights  of  the  form  and  so  on.  As  there  are  M  such  w/s 

to  compute,  the  overall  cost  of  the  computation  is  0(m  +  (Mlogm)*),  where  an  obligatory  0(m) 
cost  is  added  to  account  for  preprocessing  costs.  This  complexity  is  better  than  0{Mm)  whenever 
M  =  0(m/ log*  m),  showing  that  the  strategy  is  most  useful  when  there  are  many  modules  to 
be  processed  relative  to  processors.  This  is  exactly  the  situation  we  face  when  partitioning  large 
numerical  problems.  One  of  the  more  useful  applications  of  the  new  algorithm  will  be  as  part  of 
an  approach  for  solving  two  dimensional  problems,  our  next  topic. 

4  Two  Dimensional  Partitioning 

Next  we  turn  to  partitioning  in  two  dimensions.  Our  discussion  has  three  parts.  First  we  provide 
some  contrast  by  discussing  a  closely  related  2D  partitioning  problem  wliich  is  NP-complete.  We 
then  return  to  our  original  2D  problem,  and  describe  an  algorithm  that  takes  a  given  fixed  column 
(alternately,  row)  partition,  and  finds  the  optimal  partitioning  of  the  rows  (alternately,  columns)  in 
polynomial  time.  This  result  can  be  used  to  find  an  optimal  2D  partition,  albeit  with  exponential 
complexity  when  N  and  M  are  problem  parameters..  We  describe  a  heuristic  with  polynomial-time 
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complexity  that  finds  a  local  minimum  in  the  solution  space.  Finally,  we  discuss  our  experience 
with  this  algorithm  on  large  irregular  grids  typical  of  those  used  to  solve  fluid  flow  problems. 

4.1  ML  A  A  Problem 

Consider  a  two-dimensional  n  x  m  load  matrix  representing  an  n-stage  computation,  as  follows. 
Each  column  represents  some  module,  the  weight  of  represents  the  computational  requirement 
of  module  j  during  “stage”  t.  The  columns  are  to  be  partitioned  into  contiguous  groups  and 
mapped  onto  a  linear  array  of  processors.  In  this  respect  the  problem  is  one-dimensional;  however, 
the  objective  function  is  based  on  both  matrix  dimensions,  as  we  will  see.  We  assume  that  the 
computation  requires  global  synchronization  between  stages.  The  same  partitioning  of  modules 
is  applied  to  all  stages.  Thus,  a  partitioning  that  is  good  for  one  stage  may  create  imbalance  in 
another.  The  execution  time  of  the  tth  stage  is  taken  to  be  that  of  the  most  heavily  loaded  processor 
during  the  ?th  stage,  the  stage’s  bottleneck  value.  The  overall  execution  time  is  then  the  sum  of 
bottleneck  values  from  all  stages.  The  problem  of  finding  the  optimal  partitioning  of  columns  is 
known  as  the  Multistage  Linear  Array  Assignment  (MLAA)  problem[13].  The  MLAA  problem  has 
been  shown  to  be  NP-complete.  Solutions  with  polynomial  complexity  are  known  if  the  number  of 
stages  is  constant. 

The  MLAA  problem  is  an  interesting  point  of  reference  for  the  two-dimensional  partitioning 
problem,  for,  by  changing  the  objective  function  slightly,  we  obtain  a  problem  related  to  two- 
dimensional  partitioning  that  has  low  polynomial  complexity.  Suppose  we  seek  a  partitioning  that 
minimizes  the  maximum  of  the  stage  bottleneck  weights,  rather  than  their  sum.  This  problem  is 
equivalent  to  that  of  finding  the  optimal  two-dimensional  rectilinear  partitioning,  conditioned  on 
the  row  (alternatively,  column)  partitioning  being  fixed.  For  example,  suppose  that  row  partition 
R  is  given  for  a  two  dimensional  load  matrix.  We  know  then  that  all  work  pieces  lying  in  a  given 
workload  column  y  between  workload  row  indexes  r,_i  -f  1  and  r,-  will  ’^'e  assigned  to  the  same 
processor,  in  the  row  of  processors.  We  may  therefore  aggregate  them  into  a  single  super-piece 
with  weight 

*•( 

Ai,y  =  ^  tOj-.V 

This  aggregation  creates  an  A  X  m  weight  matrix  A.  Any  subsequent  partitioning  of  the  columns 
into  M  contiguous  groups  completes  a  rectilinear  partitioning.  Like  the  MLAA  problem  we  can 
compute  the  weight  of  the  most  heavily  loaded  processor  in  each  row,  and  call  this  the  row’s 
bottleneck  weight.  The  maximum  bottleneck  weight  is  then  the  maximum  execution  weight  among 
all  processors.  However,  unlike  the  MLAA  problem,  the  optimal  column  partition  can  be  found 
quickly,  as  we  now  show. 

4.2  Optimal  Conditional  Partitioning 

The  heart  of  all  our  2D  partitioning  algorithms  is  an  ability  to  optimally  partition  in  one  dimension, 
given  a  fixed  partition  in  the  other.  Suppose  a  row  partition  R  is  given.  As  described  in  the 
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previous  subsection,  we  can  aggregate  work  pieces  forced  (by  R)  to  reside  on  a  common  processor 
into  super-pieces,  thereby  creating  an  iV  x  m  load  matrix  {Ajj  s  This  matrix  can  be  viewed  as 
N  one  dimensional  chains;  a  common  partitioning  of  their  columns  will  produce  a  2D  rectilinear 
partition. 

The  problem  of  finding  an  optimal  column  partition  can  be  approached  through  a  minor  mod¬ 
ification  to  the  ID  probe  function.  Given  bottleneck  constraint  li' ,  we  find  the  largest  index  ci 
such  that 

53  ^  chains  i. 

This  is  accomplished  with  N  binary  searches,  one  per  chain,  each  of  which  finds  the  longest  subchain 
whose  weight  is  no  greater  than  W,  ci  is  the  length  of  the  subchain  witn  fewest  modules.  Like  the 
ID  probe,  this  one  greedily  makes  ci  as  large  as  possible  without  violating  the  load  constrsunt  in  any 
chain.  Workload  columns  1  through  ci  are  assigned  to  the  processors  in  column  1  of  the  processor 
array.  The  procedure  is  repeated,  assigning  columns  ci  -f  1  through  cj  to  processor  column  2,  and 
so  on.  It  is  easily  proven  by  induction  on  M  that  this  procedure  will  find  a  partition  with  cost 
no  greater  than  W,  if  one  exists.  The  cost  of  calling  probe  is  0{N MlogTn)^  provided  we  have 
precomputed  the  partial  sums  of  all  N  chains  (a  0{nm)  startup  cost). 

We  will  later  exploit  a  useful,  self-evident  property  of  partitions  constructed  by  this  procedure. 

Lemma  2  Let  W  be  a  feasible  bottleneck  constraint,  and  let  a  row  partition  be  given.  Let  C  = 
(co,ci,C2, . .  .,CA/)  bethe  greedy  column  partition  constructed  using  W ,  andletC  —  (cojC'^Cj,  ...,c5^f) 
be  any  other  column  partition  that  gives  cost  W.  Then  for  a/1  f  =  0, 1, 2, ... ,  M,  c,-  >  cj. 

The  same  improved  searching  strategy  as  was  developed  for  the  ID  problem  can  be  applied  here. 
The  argument  for  Lemma  1  does  not  depend  on  the  partitioning  of  a  single  chain;  the  key  insight 
driving  the  proof  is  recognition  of  the  structure  of  the  optimal  greedily  constructed  partition.  The 
same  insight  applies  to  this  problem,  with  slightly  expanded  notation.  For  all  column  indices  »  <  j 
and  row  index  k,  let  Wij^k  =  ^k,t-  We  define  io  =  0,  and  for  j  =  1, . . ., M  define  ij  to  be  the 

largest  index  such  that 

probe({W,-^_i+t,f,,fc})  =  false  for  all  fc  =  l,2,...,iV,  (1) 

and  define 

Wj  =  ^min^{Wij_,+i,,v+i,fc  1  probe(iy,-^_,+i,<j+i,fc)  =  true}.  (2) 

These  new  definitions  correspond  to  the  old  ones  in  the  obvious  way.  Suppose  the  minimum 
feasible  bottleneck  constraint  is  W^pt,  and  let  F  be  the  column  processor  index  of  the  first  column 
where  a  processor  achieves  weight  Wopt\  suppose  F  >  1.  To  chose  ii  we  examine  each  workload 
row,  and  for  each  find  the  endpoint  of  the  longest  subchain  whose  weight  is  strictly  less  than  Wgpt. 
We  then  define  ii  to  be  the  smallest  among  these,  say  for  row  r.  Since  >  Wopt,  we  know 

that  probe(Wi,fj+i,r)  =  true.  This  shows  that  the  set  in  (2)  over  which  the  minimum  is  taken 
is  non-empty,  so  that  wi  is  well  defined.  The  same  observation  holds  for  (2,  *3, . . . ,  F  -  1:  each  Wj 
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is  well-defined.  Now  we  know  that  W^t>_i+i,t>+i,r  =  ^opt  for  some  r,  showing  that  up  =  Wopt- 
Since  probe(fa;j)  =  true  for  all  j  it  follows  that  Wopt  =  mini<j<Ar{wj}.  With  these 

definitions,  Lemma  1  applies  to  this  problem  as  well. 

The  cost  of  finding  an  optimal  row  partition  is  basically  the  same  as  ID  partitioning  with  a 
factor  of  N  included  to  account  for  the  N  binary  searches  each  probe  call.  There  are  also  N  times 
as  many  probe  calls  needed  to  identify  each  utj.  The  overall  time  cost  of  optimally  partitioning 
the  columns  is  thus  0{nm  +  (NMlogm)^).  It  should  be  noted  that  the  one-dimensional  chains- 
on-chains  solution  in  [5]  is  easily  adapted  to  the  optimal  conditional  partitioning  problem.  The 
adaptation  must  too  suffer  an  0{nm)  startup  cost,  plus  an  additional  factor  of  N,  yielding  an 
0{nm  -I-  NMm)  algorithm  It  is  also  possible  to  ensure  that  the  algorithm  finds  the  “greedy” 
optimum,  ie.,  the  same  one  tho  probe-based  algorithm  finds.  Lemma  2  (using  W  =  Wopt)  thus 
applies  to  this  problem  as  well.  As  we  will  see,  this  implies  that  the  dynamic-programming  based 
solution  can  be  used  in  the  iterative  refinement  algorithm  to  be  presented  in  §4.4. 

4.3  Optimal  2D  Partitioning 

It  is  possible,  if  unpleasant,  to  find  the  optimal  2D  rectilinear  partitioning  using  the  procedure 
just  described.  There  are  ways  of  choosing  a  row  partition;  for  each  we  can  determine 

the  optimal  column  partition,  and  thereby  determine  the  overall  optimal  partitioning.  It  may 
be  possible  to  reduce  the  complexity  somewhat  using  a  branch-and-bound  technique  to  limit  the 
number  of  row  partitions  considered,  nevertheless  this  algorithm  is  exponentially  complex  in  N. 
We  do  not  yet  know  if  a  polynomial-time  algorithm  exists  for  this  problem,  or  whether  optimal 
2D  rectilinear  partitioning  is  NP-complete.  We  do  know  that  in  practice  N  will  be  too  large  for 
us  to  consider  this  approach.  In  any  event,  a  well-chosen  partition  will  likely  be  adequate,  even  if 
suboptimal.  Thus,  we  next  turn  our  attention  to  a  relatively  fast  heuristic. 

4.4  Iterative  Refinement 

We  may  apply  the  conditionally  optimal  partitioning  algorithm  in  an  iterative  fashion.  Suppose 
that  a  row  partition  Ri  is  given.  For  example,  we  might  construct  an  initial  row  partition  as  follows: 
sum  the  weights  of  cdl  work  pieces  in  a  common  row,  to  create  a  super-piece  representing  that  row. 
Find  an  optimal  ID  partition  of  those  super-pieces  onto  N  processors.  Use  this  partition  as  Ru 
assume  it  to  be  fixed,  and  let  Ci  be  the  optimal  column  partition,  given  Ri.  Let  tti  =  T(i?i,C'i) 
be  the  cost  of  that  partitioning.  Next,  fix  the  column  partition  as  Ci,  and  let  R2  be  the  optimal 
row  partitioning,  given  Ci.  Let  ir2  =  Tr{R2,Ci).  Clearly  we  may  repeat  this  process  as  many  times 
as  we  like;  observe  that  odd  iterations  compute  column  partitions  and  even  iterations  compute  row 
partitions. 

We  could  choose  a  partition  vector  from  either  “direction”,  that  is,  chodse  row  indices  in  the 
sequence  ri,r2, . .  or  in  the  sequence  r//_i,rAr_2,...,T’i.  We  assume  that  the  optimal  condi¬ 

tional  partitioning  algorithm  approaches  the  problem  from  the  same  direction  every  iteration,  for 
both  the  row  and  column  partitions. 
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A  useful  feature  of  the  algorithm  is  that  at  each  iteration,  the  cost  of  the  solution  is  no  worse 
than  the  cost  at  the  previous  iteration. 

Lemma  3  Given  any  initial  row  partition  Ri,  the  sequence  iri,  )r2, . . . ,  is  monotone  non^increasing. 


Proof:  Without  loss  of  generality,  suppose  that  the  partition  produced  at  the  end  of  iteration 
t  -  1  is  a  row  partition  let  C  be  the  column  partition  treated  as  fixed  during  iteration  t  -  1. 
At  iteration  i  we  fix  the  row  partition  as  R\  and  seek  the  optimal  column  partition.  One  of  the 
possible  column  partitions  is  C",  thus  we  know  the  column  partition  found  will  have  cost  no  greater 
than  7r(i2',C").  ■ 

Iterative  refinement  defines  a  fixed-point  computation,  a  fact  that  can  be  used  as  a  termination 
condition,  as  shown  in  the  following  lemma. 

Lemma  4  For  every  starting  row  partition  R\  there  exists  an  iteration  I  such  that  Rj  =  Rj  and 
Cj  =  Cl  for  all  j  >  I. 

Proof:  We  will  need  to  refer  to  the  elements  of  Rj  and  Cj  by  both  position  within  the  vector, 
and  by  the  index  j.  We  thus  define 

Rj  =  (0,  rx(i ),  r2(i), . . . ,  m) 


and 

Cj  —  (0,  cx(j),C2(j), . .  .,c^_i(j),  n). 

By  Lemma  3  we  know  there  exists  an  index  k  and  a  value  b  such  that  tTj  =  b  for  all  j  >  k.  Let 
j  >  k,  j  odd.  Iteration  j  computes  column  partition  Cji2+i,  given  fixed  row  partition  Rj/2+1-  As 
we  compute  Rj/2+2  in  iteration  i  +  1,  a  feasible  partitioning  is  Rj/2+2  =  Rj/2+1-  However,  Rj/2+2 
is  “greedy”  with  respect  to  Cj/2+1  and  6,  while  Rj /2+1  need  not  be.  Thus,  by  Lemma  2  we  must 
have  ri{j/2  +  2)  >  ri{jl2  -f  1)  for  all  i  —  1,2, . . .,  A/  -  1.  The  same  argument  can  be  applied  to 
show  that  Ci{j/2  +  2)  >  Cj(j/2  +  1)  for  all  t  =  1, 2, . . . ,  AT  -  1.  Since  these  indices  can  not  grow 
without  bound,  eventually  the  row  and  column  partitions  must  stop  changing.  ■ 

Lemma  4  shows  that  a  safe  termination  procedure  is  to  iterate  until  the  row  and  column  parti¬ 
tions  stop  changing.  It  is  natural  to  ask  how  many  iterations  are  required  to  achieve  convergence. 
We  can  bound  this  number,  although  only  loosely. 

Lemma  5  Let  U  be  the  number  of  unique  bottleneck  constraints.  The  iterative  refinement  algorithm 
converges  in  0{U  •  (n  +  m))  iterations. 

Proof:  The  proof  of  Lemma  4  implies  that  when  convergence  has  not  yet  been  achieved,  no  more 
than  n  +  m  successive  iterations  may  occur  without  the  partition  cost  decreasing.  The  present 
lemma’s  conclusion  follows  from  the  observation  that  there  are  no  more  than  U  possible  values  for 
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Figure  2:  Example  showing  that  iterative  rehnement  may  converge  to  a  suboptimal  solution 


the  partition  cost.  In  the  worst  case  U  =  0((nm)*),  since  every  bottleneck  constraint  is  defined  by 
two  row  indices,  and  two  column  indices.  ■ 

Despite  the  0((nm)^(n  +  m))  bound,  our  experience  has  been  that  convergence  is  achieved  in 
far  fewer  iterations,  perhaps  in  0(max{JV,M})  iterations.  One  possible  explanation  is  that  the 
solution  space  for  the  problems  we  study  has  many  local  minima;  another  is  that  there  are  strong 
af.yet-undiscovered  theoretical  reasons  for  the  fast  convergence. 

The  solution  found  by  iterative  refinement  is  locally  optimal,  in  the  sense  that  we  are  unable 
to  reduce  the  partition  cost  by  moving  any  set  of  row  indices,  or  any  set  of  column  indices.  It 
may,  however,  be  possible  to  improve  the  solution  by  simultaneously  moving  a  row  index  and  a 
column  index.  This  is  illustrated  by  the  example  in  Figure  2.  It  is  possible  for  iterative  refinement 
to  converge  to  the  partition  shown  with  bottleneck  weight  3;  this  cost  is  reduced  by  appropriately 
moving  both  the  row  and  column  partitions.  The  practical  severity  of  this  phenomenon  is  unclear. 
Should  it  prove  to  be  a  problem,  the  algorithm  might  be  adapted  to  perturbation  of  row  and  column 
partitions  simultaneously  after  convergence,  to  determine  whether  any  improvement  in  the  solution 
quality  can  be  achieved. 

The  ultimate  converged  cost  of  a  partition  constructed  via  iterative  refinement  depends  on  the 
starting  partition  Ri.  We  have  tried  a  number  of  different  seemingly  natural  methods  for  computing 
^1.  Somewhat  to  our  surprise,  we  found  that  the  best  metlvod  (marginally)  is  to  generate  several 
initial  partitions  randomly,  and  keep  the  best  resulting  partition.  This  certainly  makes  sense  if  the 
partition  solution  space  has  but  a  few  local  minima.  Randomly  generation  increases  the  likelihood 
of  hitting  an  initial  partition  that  leads  to  the  optimal  solution. 

In  the  subsection  to  follow  we  discuss  an  application  of  iterative  refinement  to  irregular  mesh 
problems.  We  find  that  iterative  refinement  can  effectively  reduce  communication  costs  and  some¬ 
times  achieve  better  performance  than  other  partitioning  methods. 
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4.5  Application  of  Iterative  Refinement 

We  have  applied  iterative  refinement  to  irregular  two-dimensional  meshes  typical  of  those  used  to 
solve  two-dimensional  fluid  flow  problems  with  irregular  meshes.  One  class  of  mesh  is  “unstruc¬ 
tured”;  Figure  3(a)  illustrates  an  unstructured  grid  (called  Grid  A  henceforth)  surrounding  the 
cross-section  of  an  air-foil  [15];  (b)  shows  a  doseup  of  a  dense  region  of  A,  Figure  3(c)  illustrates 
part  of  another  unstructured  grid  [24],  called  Grid  B,  but  one  that  is  far  less  irregular.  Finally, 
Figure  3(d)  illustrates  a  grid  C  that  is  highly  regular,  except  for  an  irregularly  placed  region  of 
extremely  high  density.  All  edges  in  the  latter  grid  have  either  vertical  or  horizontal  orientation. 
As  we  will  see,  the  latter  type  of  grid  gives  rectilinear  partitioning  its  greatest  advantage  over  other 
techniques. 

The  grids  we  study  have  tens  of  thousands  of  grid  points;  A  has  11143  points  and  32818  edges, 
B  has  19155  points  and  56895  edges,  C  has  45625  points  and  90700  edges.  We  chose  to  partition 
with  the  highest  possible  reflnement;  however,  the  number  of  grid  points  precludes  the  actual 
construction  of  a  load  matrix  where  every  element  represents  at  most  one  point.  Instead,  prior  to 
an  iteration,  we  construct  a  load  matrix  with  either  N  ox  M  rows  (depending  on  whether  we  are 
performing  a  column  or  a  row  iteration),  and  T  columns,  T  being  the  number  of  points.  This  is 
accomplished  in  time  proportional  to  the  size  of  this  matrix.  While  the  cost  of  an  iteration  may 
become  dominated  asymptotically  by  this  setup  cost,  in  our  experience  it  makes  little  sense  to 
create  and  store  an  immense,  sparse  matrix.  On  the  grids  described  here,  the  complete  rectilinear 
partitioning  algorithm  ran  in  under  one  minute  on  a  Sparc  1-f  workstation.  The  other  methods 
were  not  much  faster,  as  the  I/O  time  for  loading  the  grid  tended  to  dominate  them  all.  One 
exception  to  this  occurred  on  partitioning  the  largest  grid  for  the  largest  processor  array.  The 
partitioning  algorithm  no  longer  ran  in  memory,  and  suffered  from  a  great  deal  of  paging  traffic  as 
a  consequence. 

We  report  on  experiments  conducted  on  the  three  forementioned  grids,  using  three  different  par¬ 
titioning  methods:  iterative  refinement,  binary  dissection  [1],  and  “jagged”  rectilinear  partitioning 
[20].  Binary  dissection  is  a  commonly  used  technique  which  very  carefully  balances  workload;  how¬ 
ever.  its  partitions  are  constructed  without  regard  for  communication  patterns.  Jagged  rectilinear 
partitioning  has  recently  been  proposed  to  overcome  some  of  binary  dissection’s  problems.  The 
domain  is  first  divided  in  N  strips,  of  approximately  equal  weight.  Following  this,  each  strip  is 
individually  divided  into  M  rectangles  of  approximately  equal  weight.  While  partition  cuts  do  span 
the  entire  domain  in  one  dimension,  they  are  “jagged”  in  the  other. 

We  also  experimented  with  so-called  “strip”  partitions,  defined  by  the  optimal  ID  solution  of  the 
projection  of  these  2D  problems  onto  the  line.  We  do  not  report  the  results  of  these  experiments, 
as  strip  partitions  were  uniformly  worse  than  the  ones  we  study,  due  primarily  to  excessive  inter¬ 
processor  communication  (even  if  primarily  local),  caused  by  a  poor  area  to  perimeter  ratio. 

Our  experiments  <issessed  the  overall  cost  of  a  partition  to  a  processor  to  be  the  sum  of  the 
weights  of  the  grid  points  it  is  assigned,  plus  a  communication  cost  that  depends  on  both  the 
architecture,  and  the  mapping.  Computations  on  grids  of  this  type  are  based  primarily  on  edges; 
hence  the  cost  of  a  grid  point  is  taken  to  be  the  total  number  of  its  edges.  The  communication 
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Figure  3:  Grids  used  in  application  study,  (a)  is  a  highly  unstructured  mesh  around  an  airfoil 
cross-section,  (b)  is  a  closeup.  (c)  is  a  more  regular  unstructured  mesh;  (d)  is  an  artificial  mesh 
with  perfectly  orthogonal  edges  and  an  offset  region  of  high  density. 
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cost  is  defined  by  edges  that  span  different  processors.  Each  edge  is  classified  as  being  internal, 
local,  or  global,  depending  on  whether  the  edge  is  completely  contained  in  one  processor,  spans 
processors  which  are  adjacent  in  the  processor  mesh,  or  spans  processors  which  are  more  distantly 
separated.  In  the  experiments  we  present,  “adjacent”  means  adjacent  in  a  North-East-West-South 
mesh.  We  comment  later  on  results  obtained  assuming  a  mesh  that  includes  direct  connections 
between  diagonally  adjacent  processors  as  well. 

Each  processor’s  local  communication  cost  is  taken  to  be  the  number  of  its  local  edges;  the  global 
communication  cost  is  the  number  of  global  edges  times  a  parameter  G.  An  edge’s  communication 
cost  is  charged  to  both  of  its  processors.  The  cost  of  a  partition  is  the  maximum  cost  of  any  processor 
in  that  partition.  We  may  estimate  speedup  as  the  sum  of  the  weights  of  all  grid  points  divided  by 
the  maximum  processor  cost.  We  have  experimentally  compared  a  number  of  these  estimates  with 
actual  speedup  measurements  on  an  Intel  iPSC/860,  and  found  them  to  be  reasonably  accurate. 
Of  course,  the  iPSC/860  does  not  have  the  same  type  of  local /global  communication  differential  as 
that  assumed  here;  the  cost  of  a  message  is  largely  insenstive  to  the  distance  it  must  travel  (at  least 
in  the  absence  of  serious  network  conjestion).  Nevertheless  it  seems  intuitive  that  scaling  global 
communication  by  a  parameter  (?  is  a  appropriate  model  for  the  architectures  of  interest. 

We  use  three  metrics  to  characterize  a  partition.  One  is  //,  the  fraction  of  edges  that  are 
internal;  another  is  fi  the  fraction  of  external  edges  that  are  local.  Finally,  fu  is  the  average 
processor  workload  divided  by  the  load  of  the  maximally  loaded  processor,  under  the  assumption 
that  all  communication  has  cost  0.  //  and  /x,  are  measures  of  how  well  the  partition  preserves 
locality  of  communication,  while  fu  is  a  measure  of  how  well  the  partition  balances  workload. 
Table  1  presents  these  quantities,  measured  on  our  problem  set,  mapping  to  16  X  16,  32  X  32,  and 
64  X  64  processor  arrays.  For  both  //  and  /e  we  see  that  rectilinear  partitioning  is  somewhat  better 
at  keeping  edges  internal,  and  that  it  excels  at  keeping  external  edges  local.  The  price  it  pays  for 
this  locality  is  increased  load  imbalance,  as  is  evident  from  the  fu  values.  Of  course,  this  is  to  be 
expected,  since  a  rectilinear  partition  is  a  constrained  version  of  a  binary  dissection. 

Figures  4,  5,  and  6,  give  estimated  processor  efficiencies  on  the  three  grids,  measured  as  the 
estimated  speedup  divided  by  the  number  of  processors.  Each  performance  curve  is  parameterized 
by  G,  in  order  to  show  how  performance  is  affected  by  an  increasing  cost  differential  between  local 
and  global  communication.  Each  graph  plots  performance  curves  for  each  of  the  three  partitioning 
methods  (encoded  here  as  BD,JP,and  RP)  with  16  X  16,  32  X  32,  and  64  X  64  processor  arrays.  All 
initial  RP  row  partitions  were  selecting  by  computing  the  optimal  ID  partition  for  N  processors. 

For  grid  A  we  see  that  BD  has  a  clear  advantage  over  the  other  methods  when  global  commu¬ 
nication  is  as  cheap  as  local.  However,  as  G  grows  it  increasingly  suffers  from  its  global  edges;  On 
the  16  X  16  and  32  x  32  arrays  JP  surpasses  it  once  G  >  3:  however  it  fails  to  surpass  BD  at  all  on 
the  64  X  64  array.  On  a  16  X  16  array,  RP  surpass  BD  once  G  >  5,  and  surpasses  JP  once  G  >  9. 
On  the  32  x  32  array  RP  surpass  both  BD  and  JP  after  G  >  5,  whereas  on  the  64  x  64  array  it  is 
bested  by  both  BD  and  JP.  At  this  extreme  point  most  edges  go  off  processor,  and  the  workload 
is  small.  BD’s  advantage  in  load  balancing  then  dominates.  Observe  however  that  performance  at 
the  right  end  of  the  curve  is  not  good;  this  may  be  indicative  of  placing  too  small  a  problem  on  the 
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Processor  array  16  x  16 

iflt/Ef/u) 

32x  32 
ifl^fE,fu) 

64x64 

ifijEju) 

Binary  Dissection  (0.73,0.32,0.98) 

Jagged  Partitioning  (0.70, 0.79, 0.84) 

Rectilinear  Partitioning  (0.77, 0.91, 0.27) 

(0.49,0.29,0.92) 

(0.45,0.73,0.68) 

(0.61,0.82,0.27) 

(0.19,0.24,0.72) 

(0.19,0.52,0.53) 

(0.37,0.68,0.24) 

//»  fEi  and  fu  values  for  Grid  A 

Processor  array  16  x  16 

(fijEju) 

32x32 

(fljE,fu) 

64  X  64 
(//*  /e*  fu) 

Binary  Dissection  (0.84,0.37,0.98) 

Jagged  Partitioning  (0.84, 0.95, 0.98) 

Rectilinear  Partitioning  (0.84,0.97,0.92) 

(0.67,0.37,0.96) 

(0.67,0.87,0.95) 

(0.68,0.94,0.80) 

(0.37,0.38,0.86) 

(0.39,0.77,0.86) 

(0.44,0.86,0.66) 

//»  fSi  and  fu  values  for  Grid  B 

Processor  array  16  x  16 

(fl^Eyfu) 

32x32 

UhfEifu) 

64x64 

{fhfE,fu) 

Binary  Dissection  (0.91,0.27,0.99) 

Jagged  Partitioning  (0.91,0.92,0.98) 

Rectilinear  Partitioning  (0.91,1.00,0.85) 

(0.82,0.29,0.98) 

(0.82,0.76,0.98) 

(0.83,1.00,0.85) 

(0.62,0.30,0.92) 

(0.64,0.66,0.92) 

(0.70,1.00,0.69) 

//>  /ej  and  fu  values  for  Grid  C 


Table  1:  Fraction  //  of  internal  edges,  fraction  /l  of  external  edges  which  are  local,  and  processor 
utilization  fu  under  no  communication  costs,  for  different  meshes,  processor  arrays,  and  partitioning 
methods 


machine. 

Grid  B  is  much  more  regular  than  A,  a  fact  that  translates  into  higher  performance  under 
higher  values  of  G.  On  the  two  smaller  arrays  the  RP  curves  cross  the  JP  and  BD  curves  in  the 
region  of  G  =  5.  On  the  largest  array  JP  is  somewhat  better  than  the  other  methods,  while  the 
RP  and  BD  curves  are  surprisingly  similar  after  G  >  3. 

Grid  C  was  constructed  specifically  to  highlight  RP’s  advantages  over  the  other  methods.  Under 
RP,  none  of  its  edges  are  global,  so  performance  is  insensitive  to  G.  RP’s  cross-over  points  are 
again  in  the  region  G  G  (3, 5);  owing  to  its  complete  avoidance  of  global  costs,  its  performance  is 
substantially  better  than  the  others  under  high  values  of  G. 
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Processor  Utilization 


Figure  4:  Processor  utilizations  on  the  BD,  JP,  and  RP  partitions  of  grid  A,  for  16  x  16,  32  x  32, 
and  64  x  64  processor  arrays 


16 


Processor  Utilization 


Binary  Dissectton,  16x16 
Binary  Dissection,  32  x  32 
Binary  Dissection,  64  x  64 
Jagged,  I6x  16 
Jagged,  32  x  32 
Jagged,  64  x  64 
Rectiiinear,  16x16 
Rectiiinear,  32  x  32 
Rectilinear,  64  x  64 


Figure  5:  Processor  utilizations  on  the  BD,  JP,  and  RP  partitions  of  grid  B,  for  16  x  16,  32  X  32, 
and  64  x  64  processor  arrays 
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Binary  Dissection,  16x16 
Binary  Dissection,  32  x  32 
Binary  Dissection,  64  x  64 
Jagged,  16x16 
Jagged,  32  x  32 
Jagged,  64  x  64 
Rectiiinear,  I6x  16 
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Figure  6:  Processor  utilizations  on  the  BD,  JP,  and  RP  partitions  of  grid  C,  for  16  X  16,  32  x  32, 
and  64  x  64  processor  arrays 
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Processor  array 

16x  16 

32x32 

64x64 

Grid  A 

13 

11 

37 

Grid  S 

9 

11 

19 

GridC 

5 

5 

5 

Table  2:  Iterations  used  by  iterative  refinement  to  converge 


On  these  problems,  rectilinear  partitioning  requirt  J  far  fewer  iterations  to  reach  convergence 
than  would  be  suggested  by  Lemma  5.  Table  2  gives  the  number  of  iterations  required  for  each  of 
the  nine  rectilinear  partitions  generated. 

We  also  evaluated  the  cost  of  these  partitions  assuming  that  diagonally  adjacent  processors  are 
connected  in  the  local  network.  In  every  case  the  performance  of  BP  was  completely  unaffected. 
HP’s  performance  improved  slightly,  usually  by  no  more  than  10%.  JP’s  performance  improved 
sharply,  to  the  extent  that  it  outperforms  RP  on  almost  all  the  Grid  A  and  Grid  B  partitions.  RP 
retains  its  superiority  on  Grid  C.  These  results  suggest  that  jagged  partitions  effectively  capture 
locality  when  that  locality  is  defined  to  include  diagonally  connected  processors.  Of  course,  there 
is  no  guarantee  that  a  jagged  partition  will  map  perfectly  onto  an  8-neighbor  mesh;  an  interesting 
future  line  of  inquiry  is  to  develop  algorithms  that  guarantee  such  locality.  Rectilinear  partitions 
are  most  desirable  when  the  rectilinear  constraint  matches  the  rectilinear  nature  of  North-East- 
West-South  meshes. 

The  data  presented  he^e  indicates  that  rectilinear  partitions  have  their  utility.  When  global 
communication  values  are  high,  it  is  worthwhile  to  accept  some  load  imbalance  for  the  sake  of 
communication  locality.  On  the  other  hand,  it  is  clear  that  rectilinear  partitions  are  not  desirable 
when  the  problem  is  highly  irregular  and  global  communication  is  comparatively  cheap.  We  plan 
further  experimentation  with  these  partitioning  strategies  on  actual  codes  on  actual  machines. 

5  Three  Dimensional  Partitioning 

We  have  already  seen  that  RPP  in  one  dimension  can  be  solved  in  polynomial  time;  it  is  not  yet 
known  whether  the  two-dimensional  problem  is  tractable.  In  this  section  we  demonstrate  that  RPP 
in  three  dimensions  is  NP-coiiiplete.  We  establish  the  fact  by  demonstrating  that  an  arbitrary 
monotone  3SAT  problem  [8]  can  be  solved  by  any  three-dimensional  RPP  algorithm.  Since  the 
monotone  3SAT  problem  is  NP-complete,  so  is  RPP  in  three  dimensions. 

The  general  3SAT  problem  has  the  following  form.  We  are  given  n  Boolean  literals  ari, . . . , a:«, 
and  m  clauses  C7i,...,Cm.  Each  clause  is  the  disjunction  of  three  distinct  literals,  each  of  which 
may  be  complimented  or  uncomplimented.  For  example,  (®i  -}•  xa  -f  xn)  and  (xi  -f  X2  +  X14)  are 
two  clauses.  The  3SAT  problem  is  to  find  a  Boolean  assignment  for  each  literal  such  that  every 
clause  evaluates  to  true.  The  monotone  3SAT  problem  requires  that  every  given  clause  have  either 
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all  complimented  literals,  or  all  uncomplimented  literals.  A  useful  consequence  of  the  monotone 
restriction  is  that  for  any  given  triple  of  literals  (xi^Xj^Xk)  there  are  at  most  two  clauses  involving 
all  three  simultaneously — one  where  they  are  all  complimented,  and  one  where  they  are  not.  It 
has  been  shown  that  the  monotone  3SAT  problem  is  NP-complete  [8].  Minor  modifications  to 
the  approach  we  develop  will  work  for  general  3SAT  problems;  it  is  simply  easier  to  describe  the 
transformation  if  we  assume  the  clauses  are  monotone. 

A  choice  of  partitioning  can  be  interpreted  as  an  assignment  of  literal  values  and  assessment  of  a 
clause’s  truth  value.  We  first  introduce  these  ideas  by  application  to  the  monotone  2SAT  problem,  j 

where  clauses  have  two  literals  (2SAT  can  be  solved  in  polynomial  time).  Let  xi  and  xj  be  two 

literals;  only  two  monotone  clauses  are  possible,  (xi  +  X2)  or  (xi  +  X2).  In  either  case,  only  one 
assignment  of  values  to  the  literals  can  cause  the  clause  not  to  be  satisfied,  Xx  =  X3  =  0  in  the  t 

former  case  and  xi  =  X2  =  1  in  the  latter.  We  capture  this  in  a  partitioning  framework  with  a  3  x  3 
domain  with  binary  workload  weights,  to  be  partitioned  into  four  pieces.  The  center  weight  is  1; 
one  corner  is  also  weighted  with  1  depending  on  the  clause,  and  all  other  weights  are  0.  Figure  7(a) 
illustrates  the  domain,  and  the  assignment  of  infeasibility  products  X1X2,  X1X2,  X1X2,  and  X1X2  to 
opposing  corners.  The  choice  of  a  row  partition  corresponds  to  an  assignment  to  xi,  the  choice  of 
a  column  partition  corresponds  to  an  assignment  to  X2.  Our  weighting  rule  is  to  assign  a  1  to  a 
corner  whose  infeasibility  product  is  true  when  the  corresponding  truth  assignment  fails  to  satisfy 
the  clause.  Thus,  if  (xi  +  *2)  is  a  problem  clause,  then  the  X1X2  corner  is  given  a  1;  if  (xi  +  22)  is  a 
clause  thei\  the  X1X2  corner  is  given  a  1.  If  both  clauses  appear  in  the  problem,  both  corresponding 
corners  ar3  weighted  by  1.  This  is  equivalent  to  requiring  that  xi  ©  X2  =  1  (0  being  the  exclusive 
OR  operator).  Also,  in  our  problem  transformation  it  will  be  possible  for  xx  and  X2  to  represent 
the  same  literal.  If  this  is  the  case,  we  place  Is  in  the  xxxi  and  xxX2  corners,  in  order  to  force 
a  common  selection  for  the  literal,  in  both  its  column  and  row  representations.  Figure  7(b)  and 
(c)  illustrates  the  weighting  corresponding  to  conditions  (xx  +  X2)  and  (xx  ©  X2)  respectively,  and 
shows  the  partition  corresponding  to  the  assignment  xx  =  0,  X2  =  1.  Observe  that  the  bottleneck 
weight  is  1,  whereas  it  would  be  2  if  the  infeasible  aissignment  xx  =  0  and  X2  =  0  were  chosen. 

The  infeasible  assignment  is  the  only  one  achieving  a  bottleneck  cost  of  2.  This  is  true  of  the 
construction  for  any  clause,  and  is  the  key  to  determining  whether  the  assignment  corresponding 
to  some  partitioning  satisfies  all  clauses. 

A  monotone  2SAT  problem  can  be  transformed  into  a  rectilinear  partitioning  problem  using  the 
ideas  expressed  above.  Given  n  literals  xx, . . . , x„  we  will  create  a  (4n- 1)  x  (4n~  1)  binary  domain. 

For  each  variable  we  assign  three  contiguous  rows,  and  three  contiguous  columns.  Variables’  sets 
of  rows  and  columns  are  separated  by  a  single  “padding”  row  and  single  “padding”  column  whose 
purpose  will  be  to  force  a  partition  within  each  variable’s  set  of  rows,  and  within  each  variable’s 
set  of  columns.  We  assign  Is  and  Os  described  above  for  the  3x3  intersection  of  x^’s  rows  and 
Xj’s  columns.  Elements  at  the  intersection  of  two  variables  that  never  appear  in  the  same  clause 
are  all  assigned  value  0.  We  place  a  0  wherever  a  “middle”  row  for  a  variable  meets  a  padding 
column;  likewise,  we  place  a  0  wherever  a  variable’s  middle  column  meets  a  padding  row.  Otherwise, 
every  other  entry  of  a  padding  row  or  a  padding  column  is  1.  The  construction  for  the  problem 
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Figure  7:  Transformation  of  2SAT  Problem  into  Rectilinear  Partitioning  Problem 


(xi  +  X2)(i2  +  ®3)  is  shown  in  Figure  8.  We  seek  an  optimal  rectilinear  partitioning  of  this  domain 
onto  a  (3n  -  1)  x  (3n  -  1)  array  of  processors.  Weights  in  padding  rows  and  columns  are  defined 
in  such  a  way  that  for  a  bottleneck  weight  of  1  to  be  achieved  it  is  necessary  that  a  partition  never 
group  padding  and  non-padding  rows  or  group  padding  and  non-padding  columns.  This  forces  a 
partition  of  every  variable’s  rows,  and  every  variable’s  columns. 

If  the  domain  can  be  partitioned  and  achieve  a  bottleneck  cost  of  1,  then  the  2SAT  problem  is 
solved  by  the  assignment  implicit  in  the  optimal  partitioning.  Otherwise  the  2SAT  problem  cannot 
be  solved.  Figure  8  also  illustrates  the  partition  corresponding  to  the  solution  xi  =  1,  X2  =  0,  and 

X3  =  0. 

The  extension  of  these  constructs  to  three  dimensions  is  straightforward.  Let  xi,  X2,  X3  be 
literals.  In  a  monotone  3SAT  problem  the  only  possible  clauses  are  (xj + X2 + X3)  and  (xj  -f  -1- X3); 
in  the  former  case  only  the  assignment  xi  =  X2  =  X3  =  0  fails  to  satisfy  the  clause,  in  the  latter 
case  only  xi  =  X2  =  X3  =  1  fails  to  satisfy  the  clause.  In  the  event  that  both  clauses  appear,  their 
conjunction  is  not  satisfied  if  and  only  if  the  variables  are  all  assigned  the  same  value.  Now  let  us 
associate  a  3  X  3  x  3  clause  region  with  these  literals.  The  2SAT  construction  associated  xi  with  the 
Y  dimension  and  X2  with  the  X  dimension;  we  augment  this  and  associate  X3  with  the  Z  dimension. 
It  is  convenient  to  view  a  clause  region  as  three  stacked  3x3  arrays  with  XY  orientation.  The 
centermost  element  of  the  middle  array  will  have  value  1,  all  other  elements  of  the  middle  array 
are  0.  Like  the  2SAT  problem,  the  four  corners  of  the  lowest  3x3  array  represent  products  of  all 
three  literals.  In  the  3SAT  case,  all  products  in  the  “bottom”  array  include  X3  and  all  products  in 
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Figure  8:  Example  of  2SAT  problem  (xi  +  ®2)(®3  +  ®3)  mapped  to  2D  rectilinear  partitioning  of 
9x9  binary  domain  onto  6x6  array  of  processors.  Partition  of  solution  ii  s  1,  X2  =  0,  xs  =  0  is 
shown. 


the  “top”  array  include  X3.  The  xi  and  X2  combinations  are  identical  to  the  2SAT  problem.  For 
example,  the  infeasibility  products  in  the  northwest,  northeast,  southwest,  and  southeast  corners 
of  the  top  array  are  X1X2X3,  X1X2X3,  X1X2X3,  and  X1X2.T3  respectively.  Like  the  2SAT  problem,  we 
weight  a  corner  with  1  if  the  truth  assignment  satisfying  the  corresponding  infeasibility  product 
fails  to  satisfy  the  clause.  Thus,  if  (xi  +  X2  +  X3)  appears  as  a  clause,  we  place  a  1  in  the  X1X2X3 
corner.  If  (xi  +  X2  +  X3)  is  a  clause  then  we  place  a  1  in  the  X1X2X3  corner.  Both  Is  are  placed  if 
both  of  these  clauses  appear  in  the  problem.  All  other  entries  of  the  clause  region  are  0.  All  clause 
regions  corresponding  to  three  distinct  literals  that  do  not  appear  in  a  clause  are  zeroed  out.  Clause 
regions  involving  intersections  of  a  literal  and  itself  are  weighted  to  ensure  that  a  bottleneck  value 
of  1  is  achieved  only  if  partitions  are  chosen  corresponding  to  the  same  selection  of  literal  value  in 
each  dimension.  For  example,  if  xj  and  X3  happen  to  be  the  same  literal,  then  a  1  is  placed  in  any 
corner  whose  product  involves  X1X3  or  X1X3. 

Assignment  of  a  value  for  Xi  corresponds  to  selection  of  a  plane  with  Y Z  orientation.  The 
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plane’s  intersection  with  each  layer  in  the  clause  region  looks  the  same — it  is  either  the  a:i  =  0 
line  or  the  zj  =  1  line  as  seen  in  the  2SAT  problem  (Figure  7).  Similarly,  assignment  of  a  value 
for  X2  corresponds  to  a  plane  whose  intersection  with  each  layer  is  identical,  either  the  line  for 
xj  =  0  or  the  line  for  xj  =  1.  Finally,  selection  of  X3  =  1  is  accomplished  by  selecting  an  XY  plane 
that  separates  the  bottcm  two  layers  from  the  top  layer,  while  selection  of  X3  =  0  separates  the 
bottom  layer  from  the  top  two.  Under  this  construction,  selection  of  planes  corresponding  to  an 
assignment  that  makes  an  infeasibility  product  true  will  place  the  centermost  1  in  the  same  volume 
as  the  “infeasibility  1”,  giving  rise  to  a  bottleneck  weight  of  2,  This  fact  is  important  enough  to 
state  formally. 

Lemma  0  Let  /(xi,X2,X3)  be  any  infeasibility  product  whose  iyosition  m  a  clause  region  is  set  to 
value  1.  Then  any  partition  whose  associated  assignment  sets  /(xi,X2,X3)  =  1  places  the  infeasi¬ 
bility  1,  and  the  clause  region's  center  1  in  the  same  partition  volume.  The  bottleneck  cost  of  any 
such  partition  is  at  least  2. 

Like  the  2SAT  mapping,  we  add  “padding”  layers  to  ensure  that  any  partition  with  cost  1  must 
choose  one  of  two  planes  in  each  dimension  of  each  clause  region.  The  assignment  of  Is  and  Os  to 
padding  layers  is  similar  to  the  2SAT  case.  Figure  9  defines  the  assignment  in  terms  of  how  each 
layer’s  elements  are  weighted  in  the  immediate  vicinity  of  a  clause  region.  Figure  9(a)  shows  how 
a  portion  of  the  padding  layer  with  (XY  orientation)  is  weighted  when  centered  directly  above  or 
below  a  clause  region  (the  heavy  lines  illustrate  how  the  clause  region  is  positioned).  The  only  way 
to  separate  the  three  Is  in  each  corner  is  to  choose  the  four  partitioning  layers  with  XZ  orientation 
and  four  with  YZ  orientation  that  do  not  intersect  the  3  x  3  core.  These  layers  ensure  that  no 
elements  on  the  XZ  and  YZ  faces  of  the  clause  region  will  be  grouped  with  any  elements  from 
any  other  clause  region — at  least  if  a  bottleneck  cost  of  1  is  to  be  achieved.  Figures  9(b)  and  (c) 
then  show  how  to  weight  elements  in  padding  layers  with  XZ  and  YZ  orientation,  depending  on 
whether  the  padding  layer  intersects  a  layer  containing  a  boundary  or  middle  layer  of  the  clause 
region.  Weights  for  the  clause  region  (which  is  outlined)  are  not  included.  The  corner  Is  seen  in 
Figure  9(b)  are  adjacent  to  the  corner  Is  seen  in  Figure  9(a);  in  order  to  achieve  a  bottleneck  cost 
of  1  it  will  be  place  two  partitioning  planes  with  XY  orientation  to  contain  the  XY  padding  layer. 
This  ensures  that  any  element  at  an  XY  face  of  the  clause  region  will  not  be  grouped  with  elements 
from  any  other  clause  region. 

To  transform  a  monotone  3SAT  problem  we  construct  a  (4n  -  1)  x  (4n  -  1)  X  (4n  -  1)  domain. 
The  first  three  coordinate  positions  in  each  dimension  correspond  to  xi;  the  fourth  coordinate 
position  in  each  dimension  corresponds  to  padding,  the  next  three  coordinate  positions  in  each 
dimension  correspond  to  X2,  and  so  on.  The  domain  is  weighted  as  described  above.  We  have  seen 
that  in  order  to  achieve  a  bottleneck  cost  of  1  it  is  necessary  to  contain  each  padding  layer  with 
two  partitioning  planes.  This  defines  (2n  -  1)  planes  orthogonal  to  each  dimension.  Furthermore, 
it  is  also  necessary  to  appropriately  partition  each  clause  region  in  each  dimension.  This  leads  to 
an  additional  n  partitioning  planes  orthogonal  to  each  dimension.  Consequently,  the  dimensions  of 
the  target  architecture  are  (3n  -  1)  X  (3n  -  1)  x  (3n  -  1). 
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Figure  9:  Weighting  of  elements  in  padding  planes 


Finally,  we  must  show  that  under  this  construction,  the  3SAT  problem  has  a  solution  if  and 
only  the  corresponding  three  dimensional  partitioning  problem  achieves  a  cost  of  1.  This  is  an  easy 
consequence  of  the  fact  that  each  volume  in  a  partitioned  clause  region  has  weight  no  greater  than 
1  if  and  only  if  no  clause  region  is  partitioned  to  satisfy  one  of  its  infeasibility  conditions  (either 
clause  infeasibility  or  conflicting  assignment  of  the  same  literal).  Since  monotone  3SAT  is  in  NP, 
then  three  dimensional  RPP  is  in  NP.  Since  one  can  always  check  in  polynomial  time  whether  a 
proposed  RPP  solution  achieves  bottleneck  cost  1,  three-dimensional  RPP  is  NP-complete.  In  fact, 
since  the  RPP  matrix  we  construct  is  binary,  we  have  a  stronger  result. 

Theorem  7  Binary  RPP  in  three  dimensions  is  NP~complete. 
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6  Summary 


This  paper  examines  the  problem  of  partitioning  with  one,  two,  or  three  dimensional  rectilinear 
partitions.  When  used  to  balance  workload  in  data  parallel  computations  having  localized  commu¬ 
nication,  such  partitions  can  be  expected  to  reduce  the  need  for  expensive  global  communication. 

For  the  one-dimensional  case  we  improved  upon  the  best  published  solution  to  date  when 
m  >  Af ,  reducing  the  cost  of  finding  the  optimal  partition  of  m  modules  among  M  processors  to 
0(m  +  (Af  logm)®).  For  the  two-dimensional  case  we  showed  how  it  is  possible  to  find  the  best 
possible  partitioning  in  a  given  dimension,  provided  that  the  partition  in  the  alternate  dimension 
remains  fixed.  This  result  can  be  used  to  find  the  optimal  partition  in  two  dimensions,  but  with 
exponentially  large  cost  (if  the  numbers  of  processors  in  both  dimensions  is  a  problem  parameter). 
The  result  also  serves  as  the  basis  for  a  heuristic  that  iteratively  improves  upon  a  solution.  The 
heuristic  is  shown  to  converge  to  a  fixed  point,  in  a  bounded  number  of  iterations.  Empirical  studies 
show  that  the  heuristic  may  provide  some  performance  advantage  when  the  differential  between 
the  local  and  global  network  bandwidth  is  moderately  large.  Finally,  we  showed  that  the  problem 
of  finding  an  optimal  three  dimensional  rectilinear  partition  is  NP-complete. 
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